-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune] Better error message for Tune nested tasks / actors #25241
Conversation
python/ray/remote_function.py
Outdated
raise AbortTrialExecution( | ||
f"Your trial launched the Ray task `{name}`, but no extra CPUs were " | ||
"reserved for task execution. Specify the Tune argument " | ||
'`per_trial_task_resources={"cpu": N}` to reserve CPUs for task execution.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user could also specify the placement group factory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should settle down on whether we want PGF to be end user API. It's pretty unintuitive IMO.
], | ||
) | ||
ray.get(placement_group.ready()) | ||
# Tune launches its trainable functions in placement groups. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, would this be a bit of a jump for users as why we are suddenly talking about using dataset in the context of Tune? Should we add a bit of context as to why this is an important key concept to understand? Like resource contention, hanging etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the text around this section to clarify this.
# Setup hooks for generating placement group resource deadlock warnings. | ||
from ray import actor, remote_function | ||
|
||
if "TUNE_DISABLE_RESOURCE_CHECKS" not in os.environ: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to add to doc/source/tune/api_docs/env.rst
? Or we intend this to be more of an internal feature flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be great if you could add this to the docs.
Also
if "TUNE_DISABLE_RESOURCE_CHECKS" not in os.environ: | |
if os.environ.get("TUNE_DISABLE_RESOURCE_CHECKS", "0") != "1": |
to follow convention with other tune env variables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to keep this a hidden / internal flag only, that's not documented. It should only be used for internal debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok with me
) | ||
|
||
|
||
def test_scheduling_strategy_override(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm I got lost here. Can you add a few comments about what this is trying to capture?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
python/ray/tune/session.py
Outdated
resources = {k: float(v) for k, v in resources.items() if v > 0} | ||
|
||
raise TuneError( | ||
f"No trial resources are available for executing the task `{name}`. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
launching the {submitted}: {name}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thanks, only minor nits/suggestions
python/ray/tune/session.py
Outdated
f"> [{main_resources}] + [{resources}] * N\n" | ||
"> )\n\n" | ||
f"Where `N` is the number of slots to reserve for trial {submitted}s. " | ||
"For more information, refer to: https://docs.ray.io/en/latest/tune/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion for wording
"For more information, refer to: https://docs.ray.io/en/latest/tune/" | |
"If you are using a Ray training library, there might be a utility " | |
"function to set this automatically for you. " | |
"For more information, refer to: https://docs.ray.io/en/latest/tune/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
# Setup hooks for generating placement group resource deadlock warnings. | ||
from ray import actor, remote_function | ||
|
||
if "TUNE_DISABLE_RESOURCE_CHECKS" not in os.environ: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be great if you could add this to the docs.
Also
if "TUNE_DISABLE_RESOURCE_CHECKS" not in os.environ: | |
if os.environ.get("TUNE_DISABLE_RESOURCE_CHECKS", "0") != "1": |
to follow convention with other tune env variables
Co-authored-by: xwjiang2010 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@@ -248,7 +248,7 @@ def read_datasource( | |||
) | |||
|
|||
if len(read_tasks) < parallelism and ( | |||
len(read_tasks) < ray.available_resources().get("CPU", parallelism) // 2 | |||
len(read_tasks) < ray.available_resources().get("CPU", 1) // 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If CPU == 0 / empty, fallback to 1 instead of parallelism (too large).
Why are these changes needed?
This PR uses a task/actor launch hook to generate better error messages for nested Tune tasks/actors in the case there are no extra resources reserved for them. The idea is that the Tune trial runner actor can set a hook prior to executing the user code. If the user code launches a task, and the placement group for the trial cannot possibly fit the task, then we raise
TuneError
right off to warn the user.This isn't perfect, but should catch most common use cases and we can improve the docs as well to help cover this footgun.
Sample code:
Sample output:
Related issue number
Closes #25239